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Assume one observes independent categorical variables or, equivalently, one observes the corre- 
sponding multinomial variables. Estimating the distribution of the observed sequence amounts 
to estimating the expectation of the multinomial sequence. A new estimator for this mean is 
proposed that is nonparametric, non-asymptotic and implementable even for large sequences. 
It is a penalized least-squares estimator based on wavelets, with a penalization term inspired 
by papers of Birge and Massart. The estimator is proved to satisfy an oracle inequality and to 
be adaptive in the minimax sense over a class of Besov bodies. The method is embedded in a 
general framework which allows us to recover also an existing method for segmentation. Beyond 
theoretical results, a simulation study is reported and an application on real data is provided. 
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1. Introduction 

Assume that we observe independent categorical variables Y% , . . . , Y n which take values in 
a finite alphabet coded {1, . . . , r} with some integer r > 2. Such observations emerge, for 
instance, in DNA studies when the independence assumption is postulated. Equivalently, 
we observe the corresponding independent multinomial variables X\ , . . . , X n where Xi = 
(1,0, . . . ,0) T if Yi = 1, and so forth. In this paper, we aim at estimating the distribution 
of the observed sequence. This amounts to estimating the expectation of the sequence 
(Xi, . . . ,X n ) since the probability vector of Xi is also its expectation. To make the 
connection with known methods from the literature, let us denote by V r the set of all 
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probabilities on {l,...,r}, by T n the set {i/n,i= l,...,n} and consider the function 
7r„ :T n — > "P r such that for every i, 

7r n (i/n) = E{Xi). (1) 

Our aim is to estimate 7r„ under as little prior information as possible on this function. 
Note that 7r„ depends on n if E(Xi) does not (which is typically the case for DNA 
sequences) . 

The paper in the literature whose issue is the closest to ours is the one by Aerts and 
Veraverbeke [1]. However, the approach in this paper is asymptotic; there, Y x denotes a 
categorical response variable at a fixed design point x € [0, l] d , and the aim is to estimate 
tt(x) = E(X X ) with X x the multinomial variable corresponding to Y x on the basis of 
independent observations Y Xl , . . . , Y Xn . The main difference between this model and ours 
is that 7r does not depend on n, whence the approach is asymptotic. The similarity is that 
no specific model is postulated for this function. Assuming that tt is Lipschitz of order 
one, Aerts and Veraverbeke show that, at a fixed point x £ (0, l) d , a kernel estimator of 
7r(x) with a proper bandwidth is asymptotically Gaussian as n — > oo and they propose a 
consistent Bootstrap procedure. As usual, the optimal choice for their bandwidth depends 
on the smoothness of tt, the order n _1 '( 4+d ' they propose for d < 4 being suitable only 
if 7r is Lipschitz of order one. 

From our knowledge, [1] is the only paper where no specific model is postulated. Most 
of the known nonparametric estimators in the literature for the joint distribution of 
multinomial variables are concerned with the segmentation model, which assumes that 
7r„ is a step function. Thus, there is a partition of {1, . . . , n} into segments within which 
the observations follow the same distribution and between which observations have dif- 
ferent distributions, and an important task is to detect the change points. Within the 
segmentation model, several methods consist in minimizing, for every fixed k, a given 
contrast function over all possible partitions into k segments, then selecting a convenient 
k by penalizing the contrast function. Braun, Braun and Miillcr [8] consider a regres- 
sion model that covers the case of multinomial variables. They assume ir n independent 
of n and penalize the quasi-deviance with a modified Schwarz criterion. Lebarbier and 
Nedelec [15] consider both least-squares and likelihood contrasts and propose a penaliza- 
tion inspired by Birgc and Massart [5], allowing Tx n to depend on n. In a slightly different 
spirit, Fu and Curnow [10] maximize the likelihood under restrictions on the lengths of 
the segments. The problem with these methods, in addition to being based on a quite 
restrictive model, is that they require a long computational time which does not scale 
to large sequences: using dynamic programming algorithms, the computational effort is 
0(n 2 ) when the number of change points is known, thus 0(n 3 ) when this number is 
allowed to grow with n. 

In this paper, we do not postulate a specific model for n n . We propose an estimator 
which is adaptive, non-asymptotic, nonparametric and implcmcntable even for large n. 
For every i and j, we denote by s^' the jth component of the vector E(Xi) and we build 
a penalized contrast estimator as follows: Assume n = 2 N for some integer N > 1 and let 
{cf)\, A = 1, . . . ,n} be the orthonormal Haar basis of R™. We consider a collection M of 
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subsets of {1, . . . , n} and for every me M, wc compute the least-squares estimators of 
the vectors 



under the assumption that they all belong to the linear span of {4>\, A € to}. Then we 
select a convenient subset to by penalizing the least-squares contrast after the fashion of 
Birge and Massart [5]. At first, we consider an exhaustive strategy which roughly consists 
in defining M. as the collection of all possible subsets of {1, ... ,n}. We prove that, up to 
a logn factor, the resulting estimator satisfies an oracle inequality and is adaptive in the 
minimax sense over a class of Besov bodies. Subsequently, we consider a non-exhaustive 
strategy: Based on the compression algorithm developed in [4], we build a collection 
M in such a way that the resulting estimator indeed satisfies an oracle inequality and 
is adaptive over this class of Besov bodies (without logn factor). Both strategies are 
implementable even for large n since all the least-squares estimators can be computed at 
the same time: The computational effort is O(nlogn) with both strategies. 

In order to gain generality, we embed our method into a general framework: We consider 
a collection {S m ,m G M} of linear subspaces of E™, wc estimate the vectors s^', j = 

1. . . . , r under the assumption that they all belong to S m and then select a convenient 
model S m . Thus for the above strategies, S m is merely the span of {4>\,X <G to} for a 
given to C {1, . . . , n}. The general framework also recovers the method of Lebarbier and 
Nedelec [15] and allows us to slightly improve some of their results. Moreover, we prove 
that their method is adaptive in the segmentation model. 

The paper is organized as follows: The general framework is described in Section 2. 
Subsequently, special cases are more precisely described: the method of Lebarbier and 
Nedelec is studied in Section 3 and the exhaustive and non-exhaustive strategies based 
on the Haar basis are studied in Sections 4 and 5, respectively A simulation study is 
reported in Section 6 and an application on a DNA sequence is given in Section 7. The 
proof of the main results is postponed to Section 8. 

2. The general framework 

Assume one observes independent variables X\, . . . ,X n where Xi £ {0,1 } r with some 
integer r > 2 and the components of Xi sum up to one. Hence if one observes independent 
categorical variables Y\, . . . , Y n which take values in {1, . . . , r}, the jth component of Xi 
equals one if Yi = j and zero otherwise. We aim at estimating the expectation of the 
sequence (X\, . . . ,X n ). Notation given in Section 2.1 will be used throughout the paper. 
Our penalized least-squares estimator is defined in Section 2.2 with an arbitrary penalty 
function. The choice of the form of the penalty function is discussed in Section 2.3. 

2.1. Notation 

For our statistical model, we use the following notation: Let X be the rxn random matrix 
with ith column Xi, let s be the expectation of X and set e = X — s. The underlying 
probability and expectation in our model are denoted by P s and E s , respectively. 
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For vectors and matrices, we use the following notation: By convention, vectors in K r 
are column vectors while vectors in K™ are line vectors. The jth line of a matrix t is 
denoted by while its zth column is denoted by ij. The Euclidean inner product and 
Euclidean norm in MP, peN, are denoted by (•, -) p and || • \\ p , respectively. The space of 
r x n matrices is equipped with the inner product 

and we denote by || • || the associated norm. For every linear subspace S of M n , M r <S> S 
denotes the set of those r x n matrices the lines of which belong to 5. Finally, we use the 
same notation for a vector in R ra and for the corresponding discrete function. Thus, v 
may denote the vector (yi , . . . , v n ) as well as the discrete function v : i i— > Vi , i = 1, . . . , n. 

2.2. The estimator 

Recall that we aim at estimating s on the basis of X\, . . . , X n . For this task, consider a 
finite collection {S m ,m £ A4} of linear subspaces of M. n where M. is allowed to depend on 
n. For every m £ let s m be the least-squares estimator of s under the assumption that 
the lines of s all belong to S m - s m is the unique minimizcr of \\t — X\\ 2 over tgK r ® S m , 
that is, the orthogonal projection of X onto M. r <g> S m . For ease of computation, it is worth 
noticing that 

\\t-x\\i=j2\\t u) -x U) C 

3=1 

hence for every m and j, §m is the orthogonal projection of X^ onto S m . It is also 
worth noticing that s m minimizes the contrast 

ln (t) = \\t\\ 2 -2(X,t) 

over t £ K r (g> S m . The estimator we are interested in is the penalized contrast estimator 
defined by 

where rh minimizes j n (s m ) + pen(m) over m £ M.. The function pen : M. — > K + is called 
the penalty function and remains to be chosen, see Section 2.3. Typically, pen(m) in- 
creases with the dimension of S m ■ The performances of our estimator then depend on the 
strategy, that is, the way we choose the collection of models and the penalty function. 
Strategies where S m is generated by cither step functions or Haar functions are studied 
in detail in subsequent sections. 

Let us notice that since we do not take into account the constraint that every column 
of s is a probability on {1, . . . , r}, our estimator may not satisfy this constraint: The com- 
ponents of a given column of the estimator sum up to one provided the vector (!,...,!) 
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belongs to each S m , m S M, but it may happen that components of the estimator do 
not belong to [0, 1]. However one may always project s on the closed convex subset of 
those matrices that satisfy the constraint, getting an estimator that is closer to s than 
s. Note also that the strategies studied in Sections 3-5 are such that (1, ...,1) indeed 
belongs to each space S m , m £ M., Thus, we assume in Section 2.3 that the dimension 
of Sm is Dm > 1 for each m. 



2.3. The form of the penalty function 

The following theorem provides an upper bound for the L2-risk of our estimator, which 
gives indications on how to choose the penalty function. 

Theorem 1. Let {S m , mgM} be a finite collection of linear subspaces ofM. n , for every 
m £ M., let D m be the dimension of S m and assume D m > 1. Let pen: M — > M + satisfy 

pcn(m)>KD rn (l + 2y/2L^f Vto G M. (2) 

for some K > 1 and some positive numbers L m . Then 



E.||«- 5|| 2 < ^K-iy Jrf {II*- *n|| 2 + Pon(m)} + 16k(^±^\, 
where s m is the orthogonal projection of s onto M. r ® S m and 

E= exp(-L m D m ). (3) 

This suggests a choice for the form of the penalty function: It can be chosen in such 
a way that the upper bound of the risk is as small as possible, subject to the constraint 
(2). A possible choice is 

pen(m) =D m (kiL m + k 2 ) (4) 

for some positive fci, k 2 and L m , where the weights L m have to be chosen as small as 
possible subject to the constraint that S be small, E < 1, say. In order to check the 
relevance of this choice, we aim at proving an oracle-type inequality: It is expected that 
the risk of s is close to the minimal risk of the least-squares estimators s rn , in the sense 
that 

E 5 ||s-S|| 2 <C inf E s ||s-s m || 2 , (5) 

meM 

where, ideally, C > does not depend on r and n. 

Corollary 1. Let {S m , m € A4} be a finite collection of linear subspaces ofW 1 , for every 
m £ M , let D m be the dimension of S m and assume D m > 1 . Let pen : M. — > M + be defined 
by (4) with some positive ki, k 2 and L m 's that satisfy E< 1, see (3). Assume sf' < 1 — p 
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for some p > and all i, j. If ki and k% are large enough, then there exists C > that 
only depends on k\ , &2 and p such that 

E S || S -S|| 2 <C(1 + sup L m ) inf E s ||s - s m f. 

\ meM J meM 

Thus (5) holds with C depending on neither n nor r, provided fci, hi are large enough 
and L rn < L for every m and an absolute constant L. However, M. is allowed to depend 
on n and it is assumed that E < 1 so this could be possible only if the collection of models 
is not too rich: If M contains many models with dimension D m , then L rn has to increase 
with n to ensure £ < 1 and (5) only holds with C > depending on n in that case. 

Proof of Corollary 1. For every m, let {ipmi, ■ ■ ■ ,ip m D m } be an orthonormal basis of 
S m . For every j, the jth line of s m is the orthogonal projection of onto S m hence 

§$=J2{Xtt\^ mk ) n i> mk . (6) 
fe=i 

Since s m is defined in the same way as s m with X replaced by s, we get 

r D m 
j=l k=l 

We have var s (£^) = sp (1 — s[^), where 1 — s+ > p and the components of Si sum up 
to one so 

r 

Moreover, the components of e^) are independent and centered, hence 

E s ||s m - s m || 2 > pD m . 
It thus follows from Pythagoras' equality that 

E s ||s-s m || 2 > \\s - s m f + pD m . 
By Theorem 1, there exists C > that only depends on k\ and such that 

E s ||s-I|| 2 <C'( 1+ sup L m ) inf {||s-s m || 2 + D m }. 
\ meM J meM 

The result easily follows from the last two displays. □ 
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3. Exhaustive strategy based on indicator functions 

In the segmentation model, it is assumed that there exists a partition of {1, . . . , n} into 
intervals within which the observations follow the same distribution and between which 
observations have different distributions. This amounts to assuming that the functions 
sW, j = 1, . . . , r are piecewise-constant with respect to a common partition of {1, . . . , n}, 
so one can use in this model the general method of Section 2, as follows. 

For every subset I C {1, . . . , n} , let tpi denote the indicator function of /, which means 
that ipi is a vector in W 1 with an ith component equal to one if i G I and zero otherwise. 
Let Ai be the collection of all partitions of {1, . . . , n} into intervals and for every meM 
let S m be the linear span of {ipi,I G m}. The dimension of S m is the cardinality of m, 
hence for every D = 1, . . . ,n the number Nd of models with dimension D satisfies 

*>-(2:i) s m 

Setting L m = 2 + log(n/ 'D m ) thus ensures £ < 1 (see (3)): 



E 



exp(-L m D m ) = 5> D exp - 2 + log - £ <1. (8) 

meM D=l ^ ^ \ J J J 

By Corollary 1, the estimator based on a penalty function of the form (4) thus satisfies 
an oracle inequality up to a logn factor, provided k\ and ki are large enough. With the 
above definition of L m , the penalty function takes the form 

pen(m) =D m ^cilog^-^-^ +c 2 ^. (9) 

In the sequel, we denote by EI (an abbreviation of Exhaustive/Indicator) the strategy 
based on the above collection M. and a penalty function of the form (9). The following 
proposition is then easily derived from Corollary 1. 

Proposition 1. Let s be computed with the EI strategy. Assume s| < 1 — p, p > 0, for 

all If c\, C2 are large enough, then there exists C > that only depends on p, c\, C2 
such that 

E s || S -S|| 2 <Clogn inf E s ||s-s m || 2 . 

m£JM 

The EI strategy was first studied by Lebarbier and Nedelec [15]. Proposition 1 is a slight 
improvement of one of their results since they obtain this inequality with C replaced by 
Cr and under a more restrictive assumption. With this strategy, for every I G rh and 
every i G /, the ith column of s is the mean of the vectors (Xi)i^i so, in particular, every 
column of s is a probability on {1, . . . , r}. 

We pursue the study of the EI strategy by proving that it is adaptive in the segmen- 
tation model. For a given positive integer k, let Vk be the set of those r x n matrices 
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t such that every column of t is a probability on {1, . . . , r} and there exists a partition 
of {1, . . . ,n} into k intervals on which the functions t^\ j = 1, . . . ,r, are constant. In 
the sequel, we prove that the EI strategy is adaptive with respect to k in the sense that 
the corresponding estimator is minimax simultaneously over the sets Vk, k = 3, . . . , n. 
A lower bound for the minimax risk over Vk is given in the following theorem and the 
minimax result is derived in Corollary 2. 

Theorem 2. Assume n> 3 and fix k £ {3, ...,n}. There exists an absolute constant 
C > such that 

inf sup E s j|s-sj| 2 >Ck(log(^j 
where the infimum is taken over all estimators s. 

Corollary 2. Let s be computed with the EI strategy. If C\, C2 are large enough, n > 3 
and k € {3, . . . , n}, then there exists C > that only depends on c\ and c-i such that 

sup E s ||s - s|| 2 < Cinf sup E s ||s - s|| 2 . 

s£V k S s£V k 

Proof. We may assume S < 1 so Theorem 1 shows that there exists C > that only 
depends on c\ and C2 such that for every s 

E s \\s - S|| 2 < Cmin[ me min =jj \\s- ~s m f + D (log(^) + l) } 

^U^ji s - s -" 2+fe ( iog (?) +1 )}- 

The latter minimum vanishes if s € Vk, so the result follows from Theorem 2. □ 



4. Exhaustive strategy based on Haar functions 

In this section, we assume n = 2 N for an integer N > 1 and we consider models generated 
by Haar functions, as defined below. 

Definition 1. Let A = \jfs} 1 A(j), where A(-l) = {(-1,0)} and 

A(j) = {(j,k),k = 0,...,V-l} 

for j ^ — 1. Lei ^:1R — * {— 1, 1} fee the function with support (0,1] t/iai tafces vaZue i on 
(0, 1/2] and —1 on (1/2, 1] . For every A£ A, let <p\ be the vector in R" itre£/i ith component 
4>\i = l/V"- if ^ = (— 1, 0) and 

2 J / 2 / i \ 
<£ Ai = ^^ 2*--* 



Finite distribution estimation via model selection 



483 



if A = (j, fc) /or some j 7^ — 1. TTie orthonormal basis {cj>\, A £ A} o/l™ is called the Haar 
basis and its elements are called Haar functions. 



Let M. be the collection of all subsets of A that contain (—1,0) and for every m € M., 
let iSm be the linear span of {4>\, A £ m}. The number Np of models with dimension D 
satisfies (7). In a manner similar to that of Section 3, we denote by EH (an abbreviation of 
Exhaustivc/Haar) the strategy based on the above collection M. and a penalty function 
of the form (9). The following proposition is then easily derived from Corollary 1. 



Proposition 2. Let s be computed with the EH strategy. Assume s£ < 1 — p, p > 0, for 

all If C\, C2 are large enough, then there exists C > that only depends on p, c\, C2 
such that 



E s ||s- s\\ < Clogn inf E s ||s-s r 



A great advantage of the EH strategy is that the obtained estimator is easily com- 
putable in practice, as we explain now. For every m, let s m be the orthogonal pro- 
jection of X onto W (g> S m . Then 7„(s m ) = — p m || 2 and it follows from (6) that 



|2 



= Y l \ em \\Ml, where 



J'=l 



For every m, pen(m) only depends on D m so we can write pen(m) = pen'(D m ). Hence 
min {7„(s m ) +pen(m)} = min <- max + pen'(D) > 

meM KD<n I m,D m =D ^ " 



A (Em 



= i BMn n |-5^||4 r(fc) ||? + pen'(i3)|, 



where ||/3 T (i)||r ^ ^ l|A-(n)llr arG the coefficients arranged in decreasing order. The 
computation of s thus requires a computational effort of order O(nlogn) using the fol- 
lowing algorithm. 
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Algorithm EH: - Compute for every j and A € A. 

- Compute the coefficients for every A G A. 

- Arrange these coefficients in decreasing order. 

D 

- Compute — ^ ||/3-r(fc)Hr +pen'(D) for every D = 1, . . . ,n. 

- Find D that minimizes the quantity above. 
-Computed =J2 P%4>Hk)- 

k=l 

We establish below that the EH strategy is adaptive up to a logn factor with respect 
to classes V(R,a,p) of matrices related to Besov bodies, see Definition 2. To motivate 
the choice of these classes, we make the connection with the more usual problem of 
estimating a function tt G L 2 [0, 1] on the basis of the observations 

y t = n(i/n) + Si, i=l,...,n, (10) 

where the Si's are i.i.d. centered random variables. With the notation of Definition 1, 
define $ A (x) = 2^ 2 ip(2 j x - k) for all j > and A = (j, k) G A(j), and let $ ( _ li0) (x) = 1, 
x G [0, 1]. Then, {$a 7 A G Uj>_iA(j)} is the Haar basis of ^[0, 1] and the coefficients of 
7r in this basis are 

T\ = (7T,$ A )l 2 

with (-,-)l 2 the scalar product in I^fO, 1]. If tt belongs to a Besov ball in Bp(L p ) with 
a < 1, a > 1/p — 1/2 and p<2 7 then typically, 

^gMa+i/a-i/p) \tx\ p <R p (11) 

j>0 AgA(j) 

for some R > 0, see Donoho and Johnstone [9] or Section 4.5 of Massart [16]. Constraints 
like (11) are thus often considered to derive minimax rates of convergence over Besov 
balls in model (10). Now, suppose there is tt G ^2(0, 1] such that for every i and a fixed 
j, 

Then with the notation of Definition 1 , 

1 " 

Pt ] := (s^^x) n = -^$>(i/n)* A (i/n). 

V 2 — 1 

Approximating an integral over [0, 1] with a discrete sum thus yields 

/3 A j) ~vW (12) 
Thus, by analogy with (11), we consider the following classes V(R 7 a,p). 
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Definition 2. Let R > 0, p <E (0, 2], a > 1/p — 1/2. Let B(R, a,p) be the set of matrices 
t £ W ® 1™ such that the coefficients = (t^ , 4>\) n > J ' = 1, ■ • ■ , r satisfy 

N-l 

J2 2M>(a+l/2-l/p) ^ || /3a ||p < n P/2 R P 
j=0 AeA(j) 

We define V(R,a,p) to be the set of those r x n matrices s £ B(R,a,p) such that every 
column of s is a probability on {1, . . . , r}. 

We aim to prove that the EH strategy is adaptive (up to a logn factor) in the sense 
that the corresponding estimator is minimax (up to a logn factor) simultaneously over 
sets V(R, a,p) for various R, a and p. In this regard, we first establish a lower bound for 
the minimax risk over these sets. 

Theorem 3. Let R > 0, p e (0,2] and a > l/p - 1/2. Assume n a > R and nR 2 > 1. 

Then there exists an absolute constant C > such that 

inf sup E s ||s-s|| 2 >C(n.R 2 ) 1 /( 2a + 1 \ 

where the infimum is taken over all estimators s. 

In the problem of estimating it £ £2(0, 1] within model (10), the minimax rate of 
convergence in the squared iv2-norm over sets like (11) is typically 

^2/(2a + l) n -2a/(2a+l)^ ^ 

see Donoho and Johnstone [9] or Section 4.3 of Massart [16]. Similar to (12), ||sW _ s(i)||2 
is approximately n||7r— %\\ 2 L with tt as an estimator of 7r and |j • \\l 2 as the L2-norm, so 
the lower bound in Theorem 3 compares to (13) and appears natural. 

An upper bound for the maximal risk of our estimator is given in Theorem 4 below 
and the minimax result is derived in Corollary 3. 

Theorem 4. Let s be computed with the EH strategy. Let R> 0. p <G (0,2], a> l/p — 1/2 
and let p be the unique solution m(l,oo) o/plogp = 2a + 1. Assume c\, C2 large enough, 
nR 2 > logn > 1 and n a > R^/p. Then there exists C > that only depends on a, p, c\ 
and c 2 such that 



2a/(2a+l) 

sup E a ||s-s|| 2 <C(n.R 2 ) 1 /< aa+1 > 



1 +IOE 



Corollary 3. Under the assumptions of Theorem 4, there exists C > that only depends 
on a, p, C\ and Ci such that 



sup E s ||s-S|| 2 <C 



2a/(2a+l) 

inf sup E s 1 1 S — s 1 1 2 . 

S s£V(R,a,p) 
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5. A non-exhaustive strategy based on Haar functions 

The strategy described in the preceding section is attractive because it is easily imple- 
mcntable. It satisfies an oracle inequality and is minimax over Besov bodies up to a log 
factor. In this section, wc propose another strategy based on Haar functions that is also 
easily implementable and indeed satisfies an oracle inequality and a minimax result over 
Besov bodies (without log factor) . With this strategy, M. is composed of some well-chosen 
subsets of A that contain (—1,0), whereas the exhaustive strategy of the preceding sec- 
tion considers all possible such subsets. As a consequence, the penalty function can be 
chosen smaller than in the preceding section which allows us to select models with higher 
dimensions. The obtained estimator thus should have better risk than in the preceding 
section, provided the proposed collection of models properly approximates the exhaustive 
collection. For the choice of a proper collection of models, we were inspired by Birge and 
Massart ([4] and Section 6.4 of [5]). 

In this section, we assume n = 2 N for an integer N > 1 and we use the same notation 
as in Definition 1. For every J € {0, . . . ,N — 1} let 



Mj = mcA,ra = 



j-i 



U a (j) 



i=-i 



JV-J-l 



U A'(J + fc) 



k=0 



where A' (J + k) is any subset of A(J 
the integer part of x). Let 



k) with cardinality [2 J (k + 1) 3 J (here, [x\ is 



N-l 



M = (J M. 



j=o 

and, for every m£M, let S m be the linear span of {4>\, A <G to}. We have D m > 2 J for 
every m £ A4j and, from Proposition 4 of [4], there exists an absolute constant A such 
that the cardinality of A4j is less than exp(^42 J ). Setting L m = L for every to £ M. and 
some large enough absolute constant L thus ensures £< 1, see (3). With this definition 
of L m , a penalty function of the form (4) takes the form 



pen(77i) = cD Tl 



(14) 



for some positive number c. In the sequel, wc denote by NEH (an abbreviation of non- 
cxhaustive/Haar) the strategy based on the above collection M. and a penalty function 
of the form (14). It follows from Corollary 1 that the corresponding estimator satisfies 
an oracle inequality. 



Proposition 3. Let s be computed with the NEH strategy. Assume sf < 1 — p. p > 0, 
for all i,j. If c is large enough, then there exists C > that only depends on p and c such 
that 



E s \\s 



s\\ 2 <C inf E fl ||s- 
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The NEH strategy is easily implementable, as we explain now. All elements of a given 
Mj have the same cardinality so for every m€ Mj, pen(m) only depends on J. Similar 
to Section 4, we can write pcn(m) =pen'(J), hence 



min {7„(s m ) + pen(m)} 

meM 



mm < 

0<J<N-1 



— max 

mEMj 



£ ||/3x||^ + pen'(J) 



mm < 

0<J<JV-1 



■ ||&||?+pen'(J) 



Here, 



mj 



U A ^ 

Lj=-1 



JV-J-1 



U A ( J + fc ) 



fe=0 



where A( J + fc) is the set that contains the indices of the [2 J (k + 1) 3 J bigger coefficients 
||/3^|| r , A £ A(J + fc). The computation of s thus requires a computational effort of order 
O(nlogn) using the following algorithm. 
Algorithm NEH: - Compute 0^ for every j and A £ A. 

- Compute the coefficients \\$x\\r for every A £ A. 

- For every j, arrange ||/3x|| r , A £ A(j) in decreasing order. 

- Compute rhj for every J. 

- Compute — \\P\\\'r + pcn'(J) for every J. 

XGrhj 

- Find J that minimizes the quantity above. 

- Compute = ]T 

To conclude the study of the NEH strategy, we prove that it is adaptive in the sense that 
the corresponding estimator is minimax simultaneously over sets V(R,a,p) for various 
R, a and p, see Definition 2. The following theorem provides an upper bound for the 
maximal risk of our estimator. Comparing this upper bound with the lower bound in 
Theorem 3 proves the minimax result stated in Corollary 4. 



Theorem 5. Let s be computed with the NEH strategy. Let R > 0, p £ (0, 2] and a > 
1/p — 1/2. Assume c large enough, n a > R and nR > 1. Then there exists C > that 
only depends on a, p and c such that 

sup E s ||s-S|| 2 <C(ni? 2 ) 1 /( 2Q+1 ). 



Corollary 4. Assume the assumptions of Theorem 5 are fulfilled. Then there exists 
C > that only depends on a, p and c such that 



sup 1 

seV(R,a,p) 



\s-§\\ 2 < Cinf 



sup IE s ||s — s\ 

s£V{R,a,p) 




6. Simulation study 

In this section, numerical simulations are reported to compare the performances of the 
strategies EH, NEH and EI. 



6.1. The simulation experiment 

First, we fix n = 2 10 = 1024, r = 2 and we consider six different matrices s denoted by 
s l , I = 1, . . .,6 and plotted in Figure 1. Precisely, only the points (i, ), i = 1, . . . , n, 
corresponding to the first line of a given s are plotted on this figure, since the second 
line is given by = 1 — s\ . The considered functions s are either piecewise constant 
(for I = 1,2), or regular (for I = 3,4), or piecewise linear (for I = 5,6). Second, we fix 
n = 2 10 , r = 4 and we consider two matrices s that are constructed from the six previous 
ones. They are denoted, respectively, s 1 and s 8 and their four lines are plotted in Figure 
2. For every considered (n,r, s), we generate a sequence (Yi,...,Y n ) and we compute 
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the estimator s with each of the three strategies. The obtained estimators are plotted 
together with s on Figure 3 for s — s 2 , s 3 , s 5 (here again, as r — 2. only the first line of s 
and s are represented). 

We estimate the risk E s [||s — s|| 2 ] for each s and each strategy using the following 
iterative method. At step h, we generate a sequence (Y/ 1 , . . . , Yjf) according to s, we 
compute the estimator s h from this sequence and compute the average of the values 
|| s — s || 2 , k = l,...,h; we increment h and repeat this step until the difference between 
two successive averages is lower than 10~ 2 . Let h* be the number of performed steps. 
Then the risk is estimated by the empirical risk 

fc=i 

The considered strategies are based on a penalty function that calls in either two 
constants C\,ci or a constant c, see (9) and (14). These constants have to be properly 
chosen so that the method performs well. So in order to compare the strategies in an 
objective way, we vary the constant(s) on a grid, estimate the estimator risk in each 
case and retain the optimal value (c*,^) or c* that leads to the smallest risk, which we 
call the minimal risk. Precisely, we vary c\ from to 1 by steps 0.1, C2 from to 6 by 
steps 0.1 and c from to 4 by steps 0.1. Table 1 gives the estimated minimal risk for 
each strategy and each considered s. Moreover, in order to study the influence of the 
constant(s), we plotted the risks obtained with different values of (c±,C2) or c. As an 
example the graphs obtained for the estimation of s 1 using the EH and NEH strategies 
are presented on Figure 4. 

To conclude the calibration study, let us notice that we can (at least heuristically) 
transpose to the NEH strategy the calibration method proposed in [6], since the NEH 
strategy calls in only one constant c. By doing so, our aim is to provide a data-driven 
calibration method taking into account the fact that the optimal choice for c depends 
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Figure 3. Points ) (fine point) for i = l,...,n and their estimators (thick point) for 

s = s 2 , s 3 , s 5 obtained, respectively, with the EH (top), NEH (middle) and EI (bottom) strategies. 



Table 1. 


Minimal 


empirical risks 


for each 


strategy 










EH 


NEH 


EI 


s 1 


14.48 


10.50 


5.34 


s 2 


23.35 


16.48 


12.98 


s 3 


7.52 


5.52 


8.03 


s 4 


32.81 


17.16 


35.07 


s 5 


33.19 


28.10 


27.46 


s 6 


9.04 


5.82 


10.89 


, 7 


27.84 


21.17 


29.84 




24.98 


19.91 


23.26 
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Table 2. Empirical risk Risks, obtained with the data- 
driven calibration and minimal empirical risks Risk c * for 
the NEH strategy 



s 


c 


a 


Risks 


c* 


Risk c * 


s 1 


1.06 


0.12 


11.01 


1.1 


10.50 


s 2 


1.06 


0.10 


16.52 


1.1 


16.48 


s 3 


0.90 


0.10 


5.42 


1.1 


5.52 


s 4 


0.74 


0.10 


16.99 


0.8 


17.16 


s 5 


0.94 


0.16 


29.22 


1.1 


28.10 


s 6 


0.82 


0.10 


6.71 


1 


5.82 


s 7 


0.76 


0.06 


20.83 


1.8 


21.17 


s 8 


0.70 


0.05 


20.17 


1.6 


19.91 



on the unknown function s. We use here the notation of Section 5. Given a value of the 
penalty constant c, the selected J, denoted here by J c , is computed by running the NEH 
algorithm with < J < J max where J max is fixed here to 7. We repeat the procedure for 
different values of c increasing from 0, until J c equals (see [14] for similar computing). 
Then we consider the particular value of c, denoted c, for which the difference between 
the dimensions of two successive selected models is the biggest. Finally, the retained 
penalty constant is 2c. Table 2 gives the mean and the standard deviation of the retained 
value for c, denoted by c and cr, respectively, together with the estimated risk. For ease of 
comparison, we reported also in this table the minimal risk and its corresponding optimal 
constant c* defined above. 
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6.2. Comments 

In terms of risk, the NEH strategy outperforms the EH strategy for every considered 
s, see Table 1. This is consistent with our theoretical results since the NEH strategy 
satisfies oracle and minimax results where the EH strategy satisfies these results only up 
to a log factor. The EI strategy outperforms the NEH strategy, as expected, when s is 
piecewise constant, whereas the NEH strategy outperforms the EI strategy for all the 
other considered s. 

In terms of computational issues, both strategies based on Haar functions outperform 
the EI strategy. Indeed, with these strategies the computational effort is only of order 
O(nlogn), see Sections 4 and 5. On the contrary, even using dynamic programming the 
computational complexity of the EI strategy is of order 0(n 3 ). 

The last issue concerns the choice of the penalty constants. First, we observed in the 
simulation results that for every considered s l and whatever the strategy, small departures 
from the optimal values of the penalty constants do not damage the risk of the estimator 
(see Figure 4 for the estimation of s 1 with the EH and NEH strategies). Now a great 
advantage of the NEH strategy in comparison with both exhaustive strategics comes from 
the fact that it calls in only one constant. One can then use the data-driven calibration 
method described in Section 6.1 and we observe that this method works successfully since 
the obtained risks are very close to the minimal risks (see Table 2). 

7. Application to change point detection 

In this section, we provide an application of our method to the change point detection 
problem. Precisely, we adapt our method to this problem by providing an algorithm that 
combines the NEH and the EI strategies, then we run this algorithm on a DNA sequence. 

7.1. An hybrid algorithm 

Our aim is to build a piecewise constant estimator of s in such a way that the change 
points in the estimator reflect change points in the underlying distribution s, when the 
sample is large. For this task, we first estimate s using the NEH strategy and the penalty 
function 

pen(rfT,) = 2cD m , 

where c is obtained with the data-driven calibration method described in Section 6.1. 
The obtained estimator s is piecewise constant and properly estimates s but, due to 
its construction, it possesses many jumps and not all of them have to be interpreted 
as change points in s. Thus, we use the EI strategy to decide which change points in 
s should be interpreted as change points in s. To be more precise, let us denote by J 
the set of points i G {2,. . .,n} such that Si ^ h-i- We perform the EI strategy on the 



Finite distribution estimation via model selection 493 



Table 3. Minimal empirical risks for the hybrid and the EI 
strategies 





NEH-EIi 


NEH-EIa 


EI 


s 1 


5.34 


5.06 


5.34 


s 2 


14.5 


15.48 


12.98 



collection of all the partitions of {1, . . . ,n} built on j . Here, the penalty function takes 
the form 

pen 1 (m) = Analog ^jpp- J +c 2 Y (15) 

where c\ and c 2 have to be chosen and where |J7"| denotes the cardinality of J . This 
penalty calls in two constants that may pose difficulties on practical applications, so we 
slightly modify the EI strategy by considering a simpler form of penalty function. It is 
easy to see from Section 3 that a penalty of the form cD m is allowed in connection with 
the collection of models considered there, provided c > k log(n) with some large enough 
k. Thus, instead of (15), we consider the penalty function 

pen 2 (m) = cD m . (16) 

To summarize, our hybrid algorithm consists in two steps: Provide with the NEH 
strategy a piecewise constant estimator with many jumps, then apply the EI strategy. 
This is of interest when the sample is large, since the EI strategy (and the segmentation 
methods described in Section 1) has a computational complexity that is too high to be 
implemented on a large sequence of length n, but can typically be implemented if the set 
of possible change points is restricted to J . Such an algorithm is similar in spirit to the 
algorithms in Gey and Lebarbier [11] and Akakpo [2]. In [11], the first step estimator is 
built using the CART (Classification and Regression Tree) algorithm but suffers from a 
higher computational complexity. In [2], it is inspired by our own method and relies on 
dyadic intervals. 

We report below a small simulation study to justify the use of (16). We denote by 
NEH-EIi and NEH-EI 2 the hybrid strategies with a penalty of the form (15) and (16), 
respectively, at the EI step. We vary c±, c 2 and c on the same grids as in Section 6.1 and 
compute the minimal empirical risks obtained for both hybrid strategies NEH-El! and 
NEH-EI2, when s is either s or s 2 . The results are given in Table 3, where we report 
also the minimal empirical risks obtained in Section 6.1 using the EI strategy (recall that 
among the strategies studied there, EI provides the smallest risk on s 1 and s 2 ). The risks 
obtained with the three strategies are comparable. This justifies the use of the hybrid 
strategics: They can be implemented even on large samples (whereas EI cannot) and 
compares, in terms of risk, to EI on the considered examples. 
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7.2. Application on a DNA sequence 

In this section, we run the NEH-EI2 strategy on a DNA sequence and compare our esti- 
mated change points with the annotation contained in the GcnBank database. Precisely, 
we consider the Bacillus subtilis sequence reduced to n = 2 21 = 2,097,152 bases in length. 
We consider this DNA sequence as a sequence of independent multinomial variables with 
values in the alphabet {A,C,G,T} of length r = 4, and we run NEH-EI2 with penalty 
constants calibrated with the data-driven method of Section 6.1. This provides us an 
estimator of s. We plot the four lines of this estimator together with the GenBank an- 
notation using the software MuGeN [12]. For the sake of readability, Figure 5 shows the 
results for the first 178,000 bases only. The arrows correspond to genes and their direction 
to the transcription direction. All the regions composed of genes coding for ribosomal 
RNA (dark genes on Figure 5) are detected by NEH-EI 2 . Moreover NEH-EI 2 detects 
some changes of transcription direction. These results are close to those obtained in [13] 
on the first 200,000 bases of the sequence. However our strategy is much faster and allows 
the study of a much greater part of the sequence. 

8. Proof of the theorems 

Throughout the section, C denotes a constant that may change from line to line. The 
cardinality of a set E is denoted by \E\. Moreover, we set for every x > 0, 

[x\ = sup{l G N, I < x} and \x] = inf{Z G N, I > x}. 
8.1. Proof of Theorem 1 

In order to prove Theorem 1, we follow the line of the proof of Theorem 2 in [5]. Fix 
m G M.. By definition of s and s m we have 

7n(s) + pcn(m) < 7„(s m ) + pen(m). 

But 

7 „(*) = -2<M} + ||s-i|| 2 -N| 2 
for all r x n matrices t. Therefore 

||s - s|| 2 < ||s - s,„|| 2 + 2(e,s-s m ) +pen(m) - pen(m). (17) 

We need a concentration inequality to control the random term (e, s — s m ) . More precisely, 
we will control (£,s m > — s m ) uniformly in ml G A4 by using the following lemma (see 
Appendix A.l for a proof). 
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Figure 5. Annotation of the first 178 000 bases of Bacillus subtilis, and estimator obtained 
with NEH-EI 2 . 



Lemma 1. For every m' G M. and x >0, we have 



(e, it) 



sup -— TT - V D m + D m > + 2V2x ) < exp(-ar). 

«GR''(g>(S m +S m ,) \\ u \\ 



Fix £ > and let fif (m) denote the event 



fi f (ro) = n ( sup < V An + An' + 2 y/2(L m ,D m , + 

J,^ UeR-«.(s m +s m ,) ll"ll 
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It follows from the definition of £ that 

P s (n£(m))>l-£exp(-£). (18) 
Moreover, (17) proves that on O^(m), 



||s-S|| 2 < 1 1 s - s m 1 1 2 + 2 ( y/D m + D m + 2 ^/2(L m D m + 0) X || 3 - s m 1 1 + pcn(m) - pen(m) . 

Note that ||s — s m \\ < \\s — s\\ + \\s — s m \\ and that for every positive number a, b and 6 
one has 2\[ab < 9a + 6~ 1 b. Applying this inequality twice, we get that on i}^(m) and for 
any i] € (0,1), 

\\s - s\\ 2 < \\s - s m \\ 2 + (1 - r?)((l + *7)||S - s|| 2 + (1 + Oil* - M*) 



+ - WD m + Dm + 2^2{L m D m + £)) + pcn(m) - pen(m) . 

1—7] 

But for all positive numbers a and b, one has Vfl + b < + V^; so f° r anv *7 G (0, 1) we 
get 



(y/D m + D m + 2^2(L m D m + 0) 2 < {^DZ+^D~(l + 2^2L~) + 2^/2if 

< 2(1 + r]~ l )(D m + 80 + (1 + n)D rh {l + 2 V2l~) 2 - 

In particular, for rj such that 

*=!±* 

1-7?' 

that is, 7/ = [K — 1)/(A +1), it follows from the assumption on the penalty function pen 
that on 17j(m), 

A-l\ 2 ,„ A 2 + 4A-l n ,„ 2K(K+1), , » 

^1 ) II s " 5 H ^ x2 _ 1 IN - Smll 2 + - (An + 80 + P cn(m). 

But A' 2 + 4 A - 1 < 2K(K + 1) and D m < pcn(m)/A' so on ? (m), 

II -„2^ 2A(A +!)% „ _ ic A(A+l)% 
||«-«|| < ( K _ya iW s - s ™\\ +pen(m)} + 16 (A -_^ 3 £ 

It then follows from the definition of J7^(m) and (18) that there exists a random variable 
V > such that on the one hand, 

„, 2A(A+l) 2 rll „ 9 , N1 A(A + 1) 3 

P - S || 2 < ( ^_ 1)3 ; {||g - Ml 2 + Pcn(m)} + 16 ( ^_^ V- 

and on the other hand, 

P s (V>£)<£exp(-<£) V£>0. 
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Integrating the latter inequality yields E S (V) < S, which completes the proof of the 
theorem since to is arbitrary. 

8.2. Proof of Theorems 2 and 3 

We need more notation. Let P s be the distribution of X under P s . Let u be the rxn 
matrix such that is the vector of K™ with all components equal to 1/2 for j <G {1, 2}, 
and in the case where r > 2, is the null vector of R" for all j > 2. If there exists JVeN 
such that n = 2 N , we denote by A = 1, . . .,n} the Haar basis given in Definition 1 
with indexes arranged in lexical order: an element A G A(j') is written A = 1 if j = — 1 
and A = 1 + I for some I 6 {2 J , . . . , 2 J+1 — 1} if j = 0, . . . , N — 1. In that case, for every 
D G {1, . ..,n — 1}, -D G {1,.. .,-Dq} an d A > 0, we denote by V(A,D ,D) the set of those 
rxn matrices s such that each column of s is a probability on {1, . . . , r} and s = ii + 1 
where for all j 

to is a subset of {2, . . . , 1 + Dq} with cardinality D and ||/3\||r < A for all A £ to. We will 
see later that when n~2 N for some TV G N, 

P(A,D ,D)CV, 

where denotes either or A is a well-chosen positive number and Do > D 

are well-chosen integers. It follows that when n — 2 N , 

supE s ||s-s|| 2 > sup E s ||s-s|| 2 . 

seV seV(A,D ,D) 

In order to get a lower bound on V, we thus first compute a lower bound on V(A, Dq, D) 
for arbitrary A, Do and D (see Appendix A. 2 for a proof). 

Lemma 2. Assume n — 2 N . There exists a universal constant C > such that 

inf sup E s \\s~s\\ 2 >Cd(a 2 A (log(^) +l) A (19) 
s s eP(A,£>o,D) V V V D J J u oJ 

for every A, D and Dq with A > and 1 < D < Dq < n. 

We will also use the following two lemmas. 

Lemma 3. Let u and v be r x n matrices the columns of which are probabilities on 
{1, . . . ,r}. Then, 

urn ~n2^ \\u~vff. IK(P U ,P V )\ 

mf sup E a ||s- af> 1- d -^-z L , 

s se{«,t>} 4 V » 1 J 
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where K{-,-) denotes Kullback-Leibler divergence. 
Proof. For every estimator s, 

sup E s \\s-sf>^ f{\\u-s\\ 2 + \\v-s\\ 2 )min{dP u ,dP v ). 

By convexity of the function x \—> x 2 combined to the triangle inequality, 

\\u~s\\ 2 + \\v-s\\ 2 >\\u-v\\ 2 /2. 
Moreover, Pinsker's inequality yields 



J min(dP u ,dP 1 ,) > 1 



K{P U ,P V ) 



so we get the result. □ 

Lemma 4. Let u and v be r x n matrices such that each column is a probability on 
{1, ...,r} and each component either equals zero or belongs to [1/4,1]. If P u <C P v and 
P v -C P u , then 

K(P u ,P v )<i\\u-v\\ 2 , 
where K(-,-) denotes Kullback-Leibler divergence. 

Proof. The result is obtained by bounding the Kullback-Leibler divergence by the % 2 
divergence: 

, 



i=l j=l 

Proof of Theorem 2. We distinguish between three cases. First, we consider the case 
where k > 4, since in that case we can apply Lemma 2. Then we assume k = 3. If n is 
bounded (n < 12, say) then the log term in the lower bound can be viewed as a constant 
and we will see that the result is an easy consequence of Lemma 3. In the last case (k = 3 
and n possibly large), we use another argument to get the log term in the lower bound. 

Case 1. n> k> 4. Note first that it suffices to prove the result under the additional 
assumption that there exists an integer N such that n = 2 N . Indeed, assume there exists 
an absolute C > such that 



inf sup E 8 ||a-s|r >Ck log - 



sev k 



for all k £ {4, . . . , n}, provided n is a power of 2. Then, in the general case, there exists 
an integer N > 2 such that 2 N < n < 2 N+1 and it is easy to derive from the last display 
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that 

inf sup E s ||s - s|| 2 > Ck( logf + 1 
s sev k \ \ k J 

for every k £ {4, . . . , 2 W }. But 2 N > n/2 so (possibly reducing the constant C) we obtain 
the result for every k £ {4, . . . , 2^}. Now, every k with 2 N < k <n satisfies k < 2k' , where 
k' = 2 N , and we have Vw C 7-V Hence 

fc 

inf sup E s ||s - s|| 2 > inf sup E s ||s - s|| 2 > C- 
s sev k * sev k , 2 

and we get the required result for every n > 4 and k £ {4, . . . , n}. We thus assume in the 
sequel that n = 2 N and we fix k £ {4, . . . , n}. We then have V(A, D 0} D) CVk, provided 
fc > AD. Using Lemma 2 with D = [fc/4j and A = Wn/Do thus yields 

¥ sup E ,||.-lf> CLt /4j((lo g ( T ^ J ) + l)A^) 

for every D > [k/A\ with D < Ji. But [k/A\ > k/7 and [fc/4j < fc/4, so 




^'•"-"'^(Kt)")^)' (20 » 

Let Do be an integer that makes the right-hand side of the previous inequality as large 
as possible, that is, some integer that approximately equates the terms log(4D /fc) and 
n/D Q : 

n 

° = |_log(4n/fc)_ ' 
We have — 2 log a; < c _1 for every x > 0, so 



k , [An\ 1 , . 

which implies Do > [k/A\ . Moreover, Dq < n since fc < n, so (20) holds for this value of 
£>o- By (21) and the definition of Do, 

<D < 



21og(4n/fc) ~ ~ log(4n/fc) ' 
so (20) proves that there exists an absolute constant C > such that 

inf S upE s || S -s|| 2 >^mo g r^ +l-loglog^^ A (log^ 

Straightforward computations then yield the result. 
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Case 2. k = 3 and n > 12. Let 



_|_logn|_ 

and for every I = 0, . . . ,L — 1, let be that vector of R n with the ith component 9% = 
1/ V U°S n J if * 6 {1 + 'Ll°g n Ji ■ ■ - + l)[lognJ} and ipu = otherwise. For every I = 
0, . . . , L — 1, let s; be the r x n matrix defined by 

' s, (1) = 1/2 + aipi, 
< s[ 2) = 1/2 -atp l} 

^ = 0, for j > 2 (in the case r > 2), 

where 




The system {(pi,l = 0,...,L — 1} is orthonormal so ||s; — s;'|| 2 = 4a 2 for every I ^ I'. 
Moreover, > 1/4 for every i = 1, . . . , n and j = 1,2 so Lemma 4 yields 

X(P S! ,P v )<4|| Si - S H| 2 <16a 2 

for every I ^ /'. We have L > 6 and 16a 2 /logL < 2/3, since n > 12 and L > y/n. By 
Proposition 9 of [5] with C = {si,l =0, . . .,L — 1}, 77 = 2a and -ff = 16a 2 , 

infsupE s |js- s|| 2 > 3^ |tognj, 

where the infimum is taken over the estimators s such that each column £w is a prob- 
ability on {1, . . . , r}. Since C C P3, we get the result by extending the infimum over all 
possible estimators. 

Case 3. k = 3 and n < 12. Let v eW <g> W" be defined by v[ 1] = 3/4, v[ 2) = 1/4 and 
vr' = u$ for every 7^ (1,1), (1,2), where u is defined at the beginning of Section 
8.2. Then \\u - v\\ 2 = 1/8 and Lemma 4 yields K(P U ,P V ) < 1/2. It thus follows from 
Lemma 3 that 

inf sup E s \\s-s\\ 2 >±. 

s s£{u,v} 

Both u and v belong to V-& , so there exists an absolute constant C > such that 

inf sup E s \\s-s\\ 2 >C. 

S s<£V 3 

The result follows since n < 12 and k = 3. □ 
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Proof of Theorem 3. For every D € {1, . . . ,n— 1}, it is easy to see that 

fi{Vn~RD- {a+1/2 \D,D) cV(R 7 a,p). 

Therefore, 

sup IE S 1 1 s — J 1 1 2 > sup E s ||s-s|| 2 

seV{R,a,p) s£V(-/nRD-<.<*+ 1 / 2 '>,D,D) 

for every estimator s and fle{l...,n-l}. It then follows from Lemma 2 that there 
exists an absolute constant C > such that for every D S {1, . . . ,n — 1}, 

inf sup E s \\s-s\\ 2 >CD({nR 2 D-^ a+1 '>) A 1). 

s sS'P(-R,a!,p) 

We choose £> that approximately equates the term nR 2 D~( 2a+1 ^ with 1; we set 

D=L(ni? 2 ) 1 /( 2 «+Dj. 

Then D > 1 since ni? 2 > 1 and D <n since R<n a , so we get 

inf sup E s ||s-s|| 2 >C*L(ni? 2 ) 1/(2a+1) J. 

The result now follows from 

[(nR 2 ) 1 / ( - 2a+1 '>\ >\{nR 2 ) l ^ 2a+l \ □ 

8.3. Proof of Theorems 4 and 5 

In order to prove Theorems 4 and 5, we first control the approximation terms 

inf lis — s m 1 1 2 

meM,D m = D 1 ' 11 

for s G V(R, a, p), where s m is the orthogonal projection of s onto M. r ® 5 m . For this task, 
we use the following lemma. 

Lemma 5. LetR>0, pe (0,2], a > l/p-1/2 and seV(R,a,p). Let J e {0, ... ,7V— 1} 
cmd Zei A4j 6e defined as in Section 5. There exists some C > i/ia£ onit/ depends on p 
and a such that 

inf ||s-s m j| 2 <Cni? 2 2- 2aJ . 

Proof. In the sequel, we use the notation of Definition 8 of [5], page 237. Moreover, we 
set 
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for every j G {1, . . . , r} and A e A. By definitions of s m and M. j, we have 

N-J-l 



hif n s - Sro || 2 = J2 ii^ 



...w 2 

fe=0 A6/A(J+fc)(|A'(J+fc)|) 



Since s G V(R,a,p) 7 we have 

X] ||« < n"/ 2 i 1 'f2-f( J + fc )( Q + 1 / 2 - 1 /f). 

AeA(J+fc) 

By Lemma 2 of [5], page 246 we thus have 

]T ||/3 A ||?<(|A'(J + fc)| + l) 1 " 2/p (v^i?2-( J + fc )( Q + 1 / 2 - 1 ^) 2 . 

A£A(J+fc)(|A'(J+fc)|) 

But 1 - 2/p< and |A'(J + k)\ + 1 > 2 J (fc + l)" 3 so 

inf || S -.s m || 2 <ni? 2 2- 2Qj y(fc + l)- 3 ( 1 - 2 ^2^( Q+1 / 2 - 1 /rt, 

fe>0 

which proves the lemma. □ 

Proof of Theorem 4. Let s be an arbitrary element of V(R,a : p). All the elements m 
of a given M. j have the same cardinality and satisfy \m\ < cq2 j , with cq = 1 + X)fe>i 
Let D e {1, . . . , n} with D>c n and let 

J = sup{j = 0, ... ,7V - 1 s.t. c 2 J < D}. 

We have \m\ < cq2 j < D for every m G .M,/, so it follows from the definition of s m that 

inf \\s~s m \\ 2 < inf ||s-s m || 2 . 

m£M,D m =D meMj 

By the definition of J we have Z? < cq2 j+1 , so it follows from Lemma 5 that 

inf lis - s m || 2 < {2c ) 2a CnR 2 D- 2a . 

m£M,D m =D 

If ci and C2 are large enough, then (2) holds with L m = 2 + \og(n/D m ) and X < 1, see 
(7) and (8). It thus follows from Theorem 1 and the definition of the penalty function 
pen that there exists C > that only depends on a, p, c\ and C2 such that for every 
D = 1, . . . , n with D > cq, 

E s || s - s\\ 2 < cinR 2 D- 2a +d(i + log ( ) } ■ (22) 
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Possibly enlarging C, one gets (22) for every D = l,...,n. In order to optimize this 
inequality, we search for the D that provides the smallest upper bound, that is, some 
D that approximately equates the terms nR 2 D~ 2a and Dlog(n/D). For this task let us 
define the function / on (l,n 2a+1 ) by 

f(x)^x 1 ^ 2a+1 ^\og(nx~ 1 ^ 2a+i y))- 1/{2a+1) 

and let D = \f(nR 2 )~\ > 1. The function x \— > xlogx is increasing on [e _1 ,oo) and, by 
assumption, n a > R^/p. Therefore 

n 2a n 2a 
- W lo g —>2a + l, 

which implies D <n. Moreover, / is increasing on (l,rt 2a+1 ) and, by assumption, 1 < 
logn < nR 2 < n 2a+1 . Therefore 

f(nR 2 ) > /(logn) > 1. 

Since D < f(nR 2 ) + 1 < 2f(nR 2 ) and D > f(nR 2 ), the right-hand term of (22) is thus 
bounded by 

:>2/ y/„ r>2\\-2a i of/ D 2 



C\ nR\f{nR 2 ))- 2a + 2/(niT) 1 + log 



and the result easily follows. □ 



Proof of Theorem 5. Let s be an arbitrary element of "P(R,a,p) and recall that 
\m\ < cq2 j for every m £ Mj, where Co = 1 + J2k>i ^ ^ c ^ s large enough, then (2) 
holds with the L TO 's all equal to an absolute constant L, and S < 1 provided L is large 
enough, see Section 5. It thus follows from Theorem 1, Lemma 5 and the definition of 
the penalty function pen that there exists C > that only depends on a, p and c such 
that for every J = 0, . . . , N — 1, 

E s ||s - s\\ 2 < C{nR 2 2~ 2aJ + 2 J }. 

In order to optimize this inequality, we search for the J that provides the smallest upper 
bound, that is, some J that approximately equates the terms nR 2 2~ 2aJ and 2 J : We 
consider 

J = sup{j £ N s.t. 2 J < (ni? 2 ) 1/(2a+1) l- 

Then J is well defined, since nR 2 > 1. Moreover, J < N — 1, since n > (nR 2 ) 1 ' ( 2a+1 ) 
(recall n a > R). The result easily follows, since 

2 J <(ni? 2 ) 1 /( 2Q+1 ) and 2^ < (ni? 2 )" 1 /^ 1 ). □ 
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A.l. Proof of Lemma 1 

Let S = S m + S m i for some m! £ M. and s' be a r x n random matrix that has the same 
distribution as e and is independent of e. Set 



: sup ~T~T 

um r ®s \\u\\ 



(A.l) 



and for every i = 1, . . . , n let us define in the same way as Z , but with e, replaced 
bye^: 



Z (l) = sup — ( V(£fc,u fc ) ? . + (e' i ,u i ) r ), 
um^s \\u\\ \^ J 



where (-,-)r denotes the Euclidean inner product in W. By the Cauchy-Schwarz in- 
equality the supremum in (A.l) is achieved at an e-measurable point u* (which is the 
orthogonal projection of e onto W Cg) S). Hence 



Z-Z® < -J-irie 



It is assumed that e' has the same distribution as e and is independent of e. Since e' is 
centered and u* is e-measurable, we get 



e s [« £ , - 4<M 2 k] = E E < U) < U ] ( £ i )£ i 5 + e.(W } )) 
3=1 j'=l 



E« (J) ) 2 (^ ) -2x« s w+ s W)~2 e i 



*C7(0) ,*0') 0') 



3=1 



i=i 



where j(i) is the index such that X^ 1 " = 1 and Xf 1 = for all j ^ On the one 



(i) 



hand, distinguishing the cases A^ = and x\ j) = 1, one has A^ - 2Xi j) s[ j) + < 1. 



On the other hand, using the inequality 2\/ab < a + b for all positive numbers a and 6 
and Jensen's inequality, we get 



2E< 



j'=i 



<(< W)) f+ E < ( ^ (J) <Ew 



Wv 2 



3=1 



Hence 



^(Z-^) 2 l z>2( „| e 



<2. 
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By Corollary 3 of [7] we thus have 

(e,u) 



sup 

um r ®s \\u\ 



>E S 



(e,u) 



sup 



2y2x < exp(— x) 



for all x > 0. Let D be the dimension of S and {<fii, . . . , 4>d} be an orthonormal basis of 
S. Recall that the above supremum is the norm of the orthogonal projection of e onto 
W <g> S so by Jensen's inequality 



E, 



sup 
ueir«>s ||w|| 



vj=i fe=i / 

1 /2 

< (EEE s ((^, e «) n ) 2> ) . 

\i=i fe=i / 
But D < D m + !?,„', var s (£^"''') < for every i and j and ||^fe|| 2 = 1 for every k, so 

(e,u)~ 



E, 



sup 

ueR'-i»s \\ u \ 



< y/D m + Dn 



and the result follows. 



A. 2. Proof of Lemma 2 

Let n be the counting measure on N. We consider the distance 5 on the set of all subsets 
of {1, . . ., n} given by 

5{m, m') = \ J \t m {x) -t m ,(x)\dn(x). 
We begin by proving the following lemma. 

Lemma 6. Let D and Dq be positive integers with D < Dq. Assume that either Do > 18D 
or Dq < 18D and D > 534. Then there exists a set M of subsets of {2, . . . , Do + 1} such 
that |m| < D for all m G M, 8(m, m') >D/7 for all m ^ ml £ M, \M\ > 6 and 

log|M|>^log(§V18). 

Proof. Assume first Dq > 18D. The result is trivial for D = 1 (consider M = {{j}, j = 
2, . . . , Dq}) so we assume in the sequel D > 2. By Lemma 4 of [5], there exists a set M 
of subsets of {2, . . . , 1 + D } such that |m| = 2[L»/2j for all m € M, 8(m, ml) >D/2 for 
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all m^m! e M and 



log|M|> L-D/2J log 



( Do 

\[D/2\ 



log 16+ 1 



Since [D/2\ > 1 and D Q > 18D, we thus have \M\ > 6. Moreover, log 8 
(log(A)/-D))/2 and LD/ 2 J > D/3, so 

log|M|>5logf^ 



1 < 



6 



A D 



This set M thus satisfies all the required properties and the proof is complete in the case 
D > 18D. Assume now D < 18D and D > 534. Let k = [D/6\ . We use Lemma 9 of [3] 
with £1 = {2, . . . ,6k + 1} , C ~3k and q = k. We know from this lemma that there exists 
a set M of subsets of {2, . . . , 6fc + 1} such that |m| = 3fc for all m G M, 8(m, m!)>k for 
all m 7^ m' £ M and 



\M\ > | 



On the one hand, 

"3/c-l 



(6fe) 



J (6fc-2a) 



a=0 



'3fe-2 



I(6fe 



2a -T 



> [2 3fc (3fc)!] x ^-'(Sfc-l)!]. 



3fc-l/ 



On the other hand, (3fc)! < (3 fc fc!) 3 and (2fc)! > 2 2k - 1 k\{k - 1)!, so 

1 2 10fc 
' M| ^32 X ^- 

Also, fc 3 < 10 4 (2 8 /3 5 ) fc and 32 x 10 4 < (4/3) fe / 2 since k > 89 (recall L> > 534), so \M\ > 
(4/3) fc / 2 . It is then easy to check that M satisfies all the required properties. □ 

We turn now to the proof of Lemma 2, distinguishing between two cases. 

Case 1. Either D > 18D or D < 18D and D > 534. Let A > and D ,D integers 
with 1 < D < Do < n. Assume that either D > 18D or D < 18D and D > 534. For 
every m C {2, . . . , Do + 1} let s m be the r x n matrix defined by 



9 (1) 
.(2) 



l/2 + «EA em ^A, 

1 /2-aEAem^ , A, 
0, 



for j > 2 (in the case r > 2), 



where 
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Let || • ||oo denote the supremum norm in R n . By definition, ||^a|U < 2 J '/ 2 /V™ for 

every 

A G A(j) and j = 0, . . . ,N— 1. Moreover, the functions (f>\, A G A(j) have disjoint supports 
so 

2 i/2 



E < 

AeA(j) 



< 



for every j. Let J be that integer such that 2 Jo < D + 1 < 2 Jo+1 . Then, 



Jo 



{2,...,£> + l}c U A C?) 

j'=o 

(recall |A(j)| = 2 J for j ^ —1) so for every m C {2, . . . , 1 + -Do}, 



<y-^< 5 ./^. 

z — ' Jn V n 



Therefore, a|| S Agm <^a||oo < 1/4 for all m C {2, . . . , D + 1}. It follows that s m G 
P(A, Dq, D) for every subset m of {2,..., Do + 1} with cardinality |m| < D. Wc now 
use Proposition 9 of [5], page 263 with C = {s m ,m G M} and M is as in Lemma 6. By 
Lemma 4, 

K(P Sm> P Sm ,)<4\\s m -s ml \\ 2 
for every m ^ 777,' G M, and since the vectors are orthonormal, we have 



S m Srn' If — ^ ^ 



4a 2 <5(m, to'). 



Therefore, K(P Sm ,P Sm ,)< \Qa 2 D and ||s m -s m <|| 2 > Aa 2 D/7 for every to 7^ to' G M. We 
thus set if = a 2 D/2 and H — 16a 2 D. By definition of a, 



150 



log I A/ 1 > — log — ^ V18 > — 3600a^. 



D 



D 



150 



Therefore, iJ/log|M| < 2/3 and Proposition 9 of [5] yields 

a 2 D 



infsupE s ||s-s|| > , 
s sec ^4 

where the infimum is taken over all estimators s such that each column s^') is a probability 
on {1, . . . , r}. Since C C P(A, Dq, D), we then get (19) by extending the infimum over all 
possible estimators. 
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Case 2. D < 18 D and D < 533. Let A > and D , D integers with 1 < D < D < n, 
Dq < 18D and D < 533. Let v = s m where s m is defined as in Case 1 with m = {2} and 

A 1 Jn 
V2 4 4 

We have v G P(A,D a ,D) and, similar to Case 1, ||w - v\\ 2 = 2a 2 and K(P U ,P V ) < 8a 2 . 
By Lemma 3, 

a 2 a 2 

inf sup E s ||s-s|| 2 >— (1-V8^2)> — 

^ sG{'u,t;} ^ 4 

and the result follows from the assumptions Dq < 18D and D < 533. 
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